nchypergeom_wallenius (Wallenius’ noncentral hypergeometric distribution)#

Wallenius’ noncentral hypergeometric distribution models biased sampling without replacement from a finite population with two types of objects.

This notebook uses the same parameterization as scipy.stats.nchypergeom_wallenius:

  • M = total population size (integer)

  • n = number of Type I objects in the population (integer)

  • N = number of draws (integer)

  • odds = odds ratio (\omega>0) (real)

Learning goals#

By the end you should be able to:

  • recognize when Wallenius’ noncentral hypergeometric model is appropriate

  • write down the PMF/CDF and understand the sampling mechanism

  • compute mean/variance/skewness/kurtosis and entropy numerically

  • implement NumPy-only sampling (sequential biased draws)

  • visualize PMF/CDF and validate with Monte Carlo simulation

  • use scipy.stats.nchypergeom_wallenius for computation and inference workflows

Table of contents#

  1. Title & Classification

  2. Intuition & Motivation

  3. Formal Definition

  4. Moments & Properties

  5. Parameter Interpretation

  6. Derivations

  7. Sampling & Simulation

  8. Visualization

  9. SciPy Integration

  10. Statistical Use Cases

  11. Pitfalls

  12. Summary

import math

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)

rng = np.random.default_rng(42)

1) Title & Classification#

Name: nchypergeom_wallenius (Wallenius’ noncentral hypergeometric distribution)
Type: Discrete

We follow the SciPy naming/notation:

  • M = total population size

  • n = number of Type I objects (so Type II count is M-n)

  • N = number of draws (without replacement)

  • odds = (\omega), the odds ratio favoring Type I

Let (X) be the number of Type I objects drawn after N biased draws.

Support: [ x \in {x_\ell, x_\ell+1, \dots, x_u} ] where [ x_\ell = \max\bigl(0,; N-(M-n)\bigr),\qquad x_u = \min(N, n). ]

Parameter space: [ M\in\mathbb{N},\qquad 0\le n\le M,\qquad 0\le N\le M,\qquad \omega>0. ]

Interpretation:

  • M, n describe the composition of the urn/population

  • N is how many draws you take

  • odds controls how strongly Type I is preferred ((\omega>1)) or discouraged ((\omega<1))

2) Intuition & Motivation#

The sampling story (biased urn)#

Imagine an urn with:

  • n red balls (Type I)

  • M-n blue balls (Type II)

You draw exactly N balls one-by-one without replacement, but the draw is biased:

  • if there are (r) red and (b) blue remaining, then

[ \mathbb{P}(\text{next draw is Type I}\mid r,b) = \frac{\omega r}{\omega r + b} ]

So (\omega) multiplies the attractiveness of Type I relative to Type II.

A key sanity check: if there is one ball of each color left, then [ \mathbb{P}(\text{Type I}) = \frac{\omega}{\omega+1}. ]

What this distribution models#

  • Preferential selection without replacement (finite population correction matters)

  • Biased sampling from two categories when the bias changes dynamically as items are removed

Typical real-world use cases#

  • Selection bias / preferential sampling: a subgroup is more likely to be selected, but you cannot select the same individual twice

  • Ecology / capture–recapture: heterogeneous catchability (Type I vs Type II) in a finite pool

  • Auditing / inspections: inspectors target one class of items more aggressively while sampling without replacement

  • Recommendation / ranking pipelines: drawing items sequentially with different propensities until a quota is met

Relations to other distributions#

  • Hypergeometric: when (\omega=1), the bias disappears and we recover the ordinary hypergeometric distribution.

  • Fisher’s noncentral hypergeometric (nchypergeom_fisher): another “noncentral hypergeometric” distribution with the same parameters but a different generative story.

    • Wallenius: biased sequential draws (odds applied at each draw)

    • Fisher: models a different conditioning story (“a handful of objects taken at once”)

  • Binomial approximation: if M is large and N is small (sampling fraction (N/M) is tiny), the changing composition is negligible and

[ X \approx \mathrm{Binomial}\Bigl(N,; p\Bigr),\qquad p = \frac{\omega,(n/M)}{\omega,(n/M) + (1-n/M)} = \frac{\omega n}{\omega n + (M-n)}. ]

That is: for small sampling fractions, “biased without replacement” is close to “biased with replacement”.

3) Formal Definition#

Generative definition (sequential without replacement)#

Let (R_0=n) and (B_0=M-n) be the initial counts (Type I and II remaining). For draws (t=1,2,\dots,N):

[ \mathbb{P}(\text{draw Type I at step }t\mid R_{t-1},B_{t-1}) = \frac{\omega R_{t-1}}{\omega R_{t-1}+B_{t-1}}. ]

After you draw a ball, you decrement the corresponding remaining count. Let (X) be the total number of Type I objects drawn in the N draws.

PMF (integral representation)#

There is no simple closed-form PMF in elementary functions. A standard integral form is

[ \mathbb{P}(X=x) = \binom{n}{x},\binom{M-n}{N-x} \int_0^1 \left(1-t^{\omega/D}\right)^x\left(1-t^{1/D}\right)^{N-x},dt, \qquad x\in{x_\ell,\dots,x_u} ]

where

[ D = \omega(n-x) + \bigl((M-n) - (N-x)\bigr) = \omega(n-x) + (M-n-N+x). ]

CDF#

The CDF is defined by summing the PMF over the integer support:

[ F(k) = \mathbb{P}(X\le k) = \sum_{x=x_\ell}^{\lfloor k \rfloor} \mathbb{P}(X=x). ]

In practice (and in libraries), the PMF/CDF are computed using specialized numerical methods.

4) Moments & Properties#

Because the support is finite, all moments exist.

Mean, variance, skewness, kurtosis#

Given the PMF (p(x)=\mathbb{P}(X=x)), the standard definitions are:

[ \mu = \mathbb{E}[X] = \sum_x x,p(x), \qquad \sigma^2 = \mathrm{Var}(X) = \sum_x (x-\mu)^2,p(x). ]

Define central moments (\mu_r = \mathbb{E}[(X-\mu)^r]). Then

[ \text{skewness }\gamma_1 = \frac{\mu_3}{\sigma^3}, \qquad \text{excess kurtosis }\gamma_2 = \frac{\mu_4}{\sigma^4}-3. ]

For Wallenius’ distribution, these quantities generally have no simple closed form, but they are easy to compute numerically from the PMF.

MGF / characteristic function#

With the PMF on integer support,

[ M_X(t) = \mathbb{E}[e^{tX}] = \sum_x e^{tx},p(x), \qquad \varphi_X(t) = \mathbb{E}[e^{itX}] = \sum_x e^{itx},p(x). ]

Entropy#

The Shannon entropy (in nats) is

[ H(X) = -\sum_x p(x),\log p(x). ]

Symmetry by swapping types#

If you swap the meaning of Type I and Type II, the count transforms as (X \mapsto N-X), and the odds invert:

[ X\sim\text{Wallenius}(M,n,N,\omega) \quad\Longleftrightarrow\quad N-X\sim\text{Wallenius}(M,M-n,N,1/\omega). ]

(You can view this as “same biased sequential sampling, just relabeled”.)

def _validate_M_n_N_odds(M, n, N, odds):
    if isinstance(M, bool) or not isinstance(M, (int, np.integer)):
        raise TypeError("M must be an integer")
    if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
        raise TypeError("n must be an integer")
    if isinstance(N, bool) or not isinstance(N, (int, np.integer)):
        raise TypeError("N must be an integer")

    M = int(M)
    n = int(n)
    N = int(N)
    odds = float(odds)

    if M < 0:
        raise ValueError("M must be >= 0")
    if not (0 <= n <= M):
        raise ValueError("n must satisfy 0 <= n <= M")
    if not (0 <= N <= M):
        raise ValueError("N must satisfy 0 <= N <= M")
    if not (odds > 0):
        raise ValueError("odds must be > 0")

    return M, n, N, odds


def wallenius_support(M, n, N, odds=None):
    M, n, N, _ = _validate_M_n_N_odds(M, n, N, odds if odds is not None else 1.0)
    x_min = max(0, N - (M - n))
    x_max = min(N, n)
    return x_min, x_max


def sample_wallenius_numpy(M, n, N, odds, *, size=1, rng: np.random.Generator):
    # NumPy-only sampler via sequential biased draws.
    # Returns an integer array with shape `size` (like NumPy/SciPy rvs).
    M, n, N, odds = _validate_M_n_N_odds(M, n, N, odds)

    size = np.atleast_1d(size).astype(int)
    if np.any(size < 0):
        raise ValueError("size must be non-negative")

    if N == 0 or n == 0:
        return np.zeros(size, dtype=int)
    if n == M:
        return np.full(size, N, dtype=int)

    # Remaining counts per Monte Carlo replicate.
    r = np.full(size, n, dtype=int)
    b = np.full(size, M - n, dtype=int)

    # Sequentially draw N times.
    for _ in range(N):
        # Handle edge cases explicitly to avoid division warnings.
        p = np.where(
            r == 0,
            0.0,
            np.where(b == 0, 1.0, (odds * r) / (odds * r + b)),
        )
        u = rng.random(size)
        draw_r = u < p
        r = r - draw_r.astype(int)
        b = b - (~draw_r).astype(int)

    return n - r


def discrete_moments_from_pmf(xs, pmf):
    # Compute mean/var/skew/excess-kurtosis/entropy from a discrete PMF.
    xs = np.asarray(xs, dtype=float)
    pmf = np.asarray(pmf, dtype=float)

    total = pmf.sum()
    if not np.isfinite(total) or total <= 0:
        raise ValueError("PMF must sum to a positive finite value")

    pmf = pmf / total

    mean = float(np.sum(xs * pmf))
    centered = xs - mean
    var = float(np.sum(centered**2 * pmf))

    if var == 0.0:
        skew = 0.0
        exkurt = -3.0
    else:
        mu3 = float(np.sum(centered**3 * pmf))
        mu4 = float(np.sum(centered**4 * pmf))
        skew = mu3 / (var ** 1.5)
        exkurt = mu4 / (var**2) - 3.0

    p = pmf[pmf > 0]
    entropy_nats = float(-np.sum(p * np.log(p)))

    return {
        "mean": mean,
        "var": var,
        "skew": float(skew),
        "exkurt": float(exkurt),
        "entropy_nats": entropy_nats,
    }


def mgf_from_pmf(t, xs, pmf):
    xs = np.asarray(xs, dtype=float)
    pmf = np.asarray(pmf, dtype=float)
    pmf = pmf / pmf.sum()
    return float(np.sum(np.exp(t * xs) * pmf))


def cf_from_pmf(t, xs, pmf):
    xs = np.asarray(xs, dtype=float)
    pmf = np.asarray(pmf, dtype=float)
    pmf = pmf / pmf.sum()
    return complex(np.sum(np.exp(1j * t * xs) * pmf))


def wallenius_mean_field_mean(M, n, N, odds, *, tol=1e-12, max_iter=200):
    # Approximate E[X] by a mean-field ODE argument.
    #
    # This solves for mu in:
    #     (1 - mu/n)^(1/odds) = 1 - (N - mu)/(M - n)
    # within the feasible interval [x_min, x_max].
    M, n, N, odds = _validate_M_n_N_odds(M, n, N, odds)

    x_min, x_max = wallenius_support(M, n, N, odds)

    if N == 0 or n == 0:
        return 0.0
    if n == M:
        return float(N)

    m2 = M - n

    def f(mu):
        mu = float(mu)
        left_base = max(0.0, 1.0 - mu / n)
        right = 1.0 - (N - mu) / m2

        if left_base == 0.0:
            left = 0.0
        else:
            left = math.exp((1.0 / odds) * math.log(left_base))

        return left - right

    lo, hi = float(x_min), float(x_max)
    if lo == hi:
        return lo

    f_lo, f_hi = f(lo), f(hi)

    if f_lo == 0.0:
        return lo
    if f_hi == 0.0:
        return hi

    # In extreme cases, numerical bracket might not contain a sign change;
    # fall back to an endpoint that matches the direction of bias.
    if f_lo * f_hi > 0:
        return hi if odds > 1 else lo

    for _ in range(max_iter):
        mid = 0.5 * (lo + hi)
        f_mid = f(mid)
        if abs(f_mid) < tol or (hi - lo) < tol:
            return mid
        if f_lo * f_mid <= 0:
            hi, f_hi = mid, f_mid
        else:
            lo, f_lo = mid, f_mid

    return 0.5 * (lo + hi)
from scipy.stats import hypergeom, nchypergeom_wallenius

M, n, N, odds = 140, 80, 60, 0.5
x_min, x_max = wallenius_support(M, n, N, odds)
xs = np.arange(x_min, x_max + 1)

pmf = nchypergeom_wallenius.pmf(xs, M, n, N, odds)

# Sanity check: odds=1 reduces to the ordinary hypergeometric distribution.
pmf_unbiased = nchypergeom_wallenius.pmf(xs, M, n, N, 1.0)
pmf_h = hypergeom.pmf(xs, M, n, N)
max_abs_diff_odds1 = float(np.max(np.abs(pmf_unbiased - pmf_h)))

mom_from_pmf = discrete_moments_from_pmf(xs, pmf)
mean_scipy, var_scipy, skew_scipy, exkurt_scipy = nchypergeom_wallenius.stats(
    M, n, N, odds, moments="mvsk"
)

# Monte Carlo moments using the NumPy-only sampler
mc = sample_wallenius_numpy(M, n, N, odds, size=80_000, rng=rng)

{
    "support": [int(x_min), int(x_max)],
    "pmf_sum": float(pmf.sum()),
    "odds1_matches_hypergeom_max_abs_diff": max_abs_diff_odds1,
    "mean_pmf": mom_from_pmf["mean"],
    "mean_scipy": float(mean_scipy),
    "mean_mean_field": wallenius_mean_field_mean(M, n, N, odds),
    "mean_mc": float(mc.mean()),
    "var_pmf": mom_from_pmf["var"],
    "var_scipy": float(var_scipy),
    "var_mc": float(mc.var(ddof=0)),
    "skew_scipy": float(skew_scipy),
    "exkurt_scipy": float(exkurt_scipy),
    "entropy_scipy": float(nchypergeom_wallenius.entropy(M, n, N, odds)),
    "mgf_t=0.1": mgf_from_pmf(0.1, xs, pmf),
    "cf_t=1.0": cf_from_pmf(1.0, xs, pmf),
}
{'support': [0, 60],
 'pmf_sum': 0.9999999999759758,
 'odds1_matches_hypergeom_max_abs_diff': 1.1518563880485999e-14,
 'mean_pmf': 26.649038741000833,
 'mean_scipy': 26.64903874100567,
 'mean_mean_field': 26.666666666660603,
 'mean_mc': 26.6575,
 'var_pmf': 8.208241763338343,
 'var_scipy': 8.208241761190717,
 'var_mc': 8.199268749999998,
 'skew_scipy': 0.035671995651632836,
 'exkurt_scipy': -0.014465480478890846,
 'entropy_scipy': 2.471394837792706,
 'mgf_t=0.1': 14.970488004642451,
 'cf_t=1.0': (0.003122019680244857+0.015529111336782947j)}

5) Parameter Interpretation#

M and n (population composition)#

  • M sets the overall population size.

  • n/M is the baseline fraction of Type I objects.

When N is not tiny relative to M, the fact that we sample without replacement matters: after you draw some Type I objects, there are fewer Type I left, so Type I becomes harder to draw later.

N (number of draws)#

  • Increasing N expands the support and typically increases both mean and variance.

odds (bias strength)#

  • odds = 1 means unbiased sampling (hypergeometric).

  • odds > 1 favors Type I, shifting mass to larger x.

  • odds < 1 discourages Type I, shifting mass to smaller x.

In the extreme limits:

  • (\omega\to\infty): you draw as many Type I objects as possible, so (X\to\min(n,N)).

  • (\omega\to 0): you avoid Type I until forced, so (X\to \max(0, N-(M-n))).

from plotly.subplots import make_subplots

M, n, N = 80, 30, 25
odds_values = [0.2, 0.5, 1.0, 2.0, 5.0]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        f"PMF vs odds (M={M}, n={n}, N={N})",
        f"PMF vs N (M={M}, n={n}, odds=2.0)",
    ),
)

x_min, x_max = wallenius_support(M, n, N, odds_values[0])
xs = np.arange(x_min, x_max + 1)

for w in odds_values:
    pmf = nchypergeom_wallenius.pmf(xs, M, n, N, w)
    fig.add_trace(
        go.Scatter(x=xs, y=pmf, mode="lines+markers", name=f"odds={w}"),
        row=1,
        col=1,
    )

odds_fixed = 2.0
N_values = [5, 15, 25, 40]
for N_ in N_values:
    x_min, x_max = wallenius_support(M, n, N_, odds_fixed)
    xs_ = np.arange(x_min, x_max + 1)
    pmf_ = nchypergeom_wallenius.pmf(xs_, M, n, N_, odds_fixed)
    fig.add_trace(
        go.Scatter(x=xs_, y=pmf_, mode="lines+markers", name=f"N={N_}"),
        row=1,
        col=2,
    )

fig.update_layout(
    height=420,
    width=980,
    legend=dict(orientation="h", yanchor="bottom", y=-0.25, xanchor="left", x=0),
)
fig.update_xaxes(title_text="x", row=1, col=1)
fig.update_xaxes(title_text="x", row=1, col=2)
fig.update_yaxes(title_text="P(X=x)", row=1, col=1)
fig.update_yaxes(title_text="P(X=x)", row=1, col=2)
fig.show()

6) Derivations#

Wallenius’ distribution is defined by a sequential, state-dependent sampling process, so closed-form algebraic expressions are rare.

We’ll do three things:

  1. derive a useful mean-field approximation for (\mathbb{E}[X])

  2. show how variance follows from the PMF (numerically)

  3. write down the likelihood for (\omega) and fit it numerically

Expectation (mean-field / ODE derivation)#

Let (a(t)) and (b(t)) be the expected numbers of Type I and II remaining after (t) draws. Using the sequential sampling rule,

[ \mathbb{E}[\Delta a] \approx -\frac{\omega a}{\omega a + b}, \qquad \mathbb{E}[\Delta b] \approx -\frac{b}{\omega a + b}. ]

Treating (t) as continuous gives the ODE approximation

[ \frac{da}{dt} = -\frac{\omega a}{\omega a + b},\qquad \frac{db}{dt} = -\frac{b}{\omega a + b}. ]

Take the ratio:

[ \frac{db}{da} = \frac{db/dt}{da/dt} = \frac{b}{\omega a} \quad\Rightarrow\quad \frac{db}{b} = \frac{1}{\omega},\frac{da}{a}. ]

Integrating:

[ \log b = \frac{1}{\omega}\log a + C \quad\Rightarrow\quad b = C,a^{1/\omega}. ]

Use the initial condition (a(0)=n), (b(0)=M-n) to get

[ \frac{b}{M-n} = \left(\frac{a}{n}\right)^{1/\omega}. ]

After (N) draws, (a(N)=n-\mu) and (b(N)=(M-n)-(N-\mu)) where (\mu=\mathbb{E}[X]). Plugging in yields

[ 1-\frac{N-\mu}{M-n} = \left(1-\frac{\mu}{n}\right)^{1/\omega}. ]

This scalar equation has a unique solution in the feasible interval ([x_\ell, x_u]) and can be solved with bisection.

Variance#

Once you can compute the PMF numerically, variance is

[ \mathrm{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \sum_x x^2 p(x) - \left(\sum_x x p(x)\right)^2. ]

Likelihood for (\omega)#

If you observe a single count (x), the likelihood for (\omega) (with M,n,N known) is

[ L(\omega\mid x) = p(x\mid M,n,N,\omega),\qquad \ell(\omega)=\log L(\omega\mid x). ]

For independent observations (x_1,\dots,x_m) with the same parameters, the log-likelihood sums:

[ \ell(\omega) = \sum_{i=1}^m \log p(x_i\mid M,n,N,\omega). ]

Because the PMF is evaluated numerically, the MLE (\hat\omega) is typically found with 1D numerical optimization.

from scipy.optimize import minimize_scalar

M, n, N, odds_true = 120, 50, 40, 2.0
x_obs = int(nchypergeom_wallenius.rvs(M, n, N, odds_true, random_state=rng))

odds_grid = np.logspace(-2, 2, 600)
logL = nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds_grid)

# MLE for odds (bounded search on a wide log-scale interval)

def nll(odds):
    return -float(nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds))

res = minimize_scalar(nll, bounds=(1e-3, 1e3), method="bounded")
odds_hat = float(res.x)

fig = go.Figure()
fig.add_trace(go.Scatter(x=odds_grid, y=logL, mode="lines", name="log-likelihood"))
fig.add_vline(x=1.0, line_dash="dash", line_color="gray", annotation_text="odds=1")
fig.add_vline(x=odds_hat, line_dash="dash", line_color="black", annotation_text="MLE")
fig.update_layout(
    title=f"Log-likelihood for odds (x_obs={x_obs}, M={M}, n={n}, N={N})",
    xaxis_title="odds",
    yaxis_title="log L(odds)",
    xaxis_type="log",
)
fig.show()

{
    "x_obs": x_obs,
    "odds_true": odds_true,
    "odds_mle": odds_hat,
    "opt_success": bool(res.success),
    "mean_field_mean_at_true": wallenius_mean_field_mean(M, n, N, odds_true),
}
{'x_obs': 20,
 'odds_true': 2.0,
 'odds_mle': 999.9999700467109,
 'opt_success': True,
 'mean_field_mean_at_true': 22.197826063638786}

7) Sampling & Simulation#

A direct NumPy-only sampling algorithm follows the generative story:

  1. Start with r = n Type I and b = M-n Type II remaining.

  2. Repeat N times:

    • compute (p = \omega r/(\omega r + b))

    • draw (U\sim\text{Uniform}(0,1))

    • if (U<p), record a Type I draw and decrement r; else decrement b

  3. Return the total Type I draws.

This is exact but costs (O(N\cdot\text{size})). For many settings (moderate N), it’s fast and simple.

SciPy’s rvs uses a specialized, optimized implementation (BiasedUrn).

M, n, N, odds = 200, 60, 40, 1.7
size = 50_000

x_np = sample_wallenius_numpy(M, n, N, odds, size=size, rng=rng)
x_sp = nchypergeom_wallenius.rvs(M, n, N, odds, size=size, random_state=rng)

{
    "np_mean": float(x_np.mean()),
    "scipy_mean": float(x_sp.mean()),
    "scipy_theory_mean": float(nchypergeom_wallenius.mean(M, n, N, odds)),
    "np_var": float(x_np.var(ddof=0)),
    "scipy_var": float(x_sp.var(ddof=0)),
}
{'np_mean': 16.27202,
 'scipy_mean': 16.25174,
 'scipy_theory_mean': 16.265095957990887,
 'np_var': 7.4177051196,
 'scipy_var': 7.4803669724}

8) Visualization#

We’ll visualize:

  • the PMF on the integer support

  • the CDF (as a step function)

  • Monte Carlo samples vs the exact PMF

M, n, N, odds = 140, 80, 60, 0.5
x_min, x_max = wallenius_support(M, n, N, odds)
xs = np.arange(x_min, x_max + 1)

pmf = nchypergeom_wallenius.pmf(xs, M, n, N, odds)
cdf = np.cumsum(pmf)

fig_pmf = go.Figure()
fig_pmf.add_trace(go.Bar(x=xs, y=pmf, name="PMF"))
fig_pmf.update_layout(
    title=f"Wallenius PMF (M={M}, n={n}, N={N}, odds={odds})",
    xaxis_title="x",
    yaxis_title="P(X=x)",
)
fig_pmf.show()

fig_cdf = go.Figure()
fig_cdf.add_trace(go.Scatter(x=xs, y=cdf, mode="lines", line_shape="hv", name="CDF"))
fig_cdf.update_layout(
    title=f"Wallenius CDF (M={M}, n={n}, N={N}, odds={odds})",
    xaxis_title="x",
    yaxis_title="P(X≤x)",
)
fig_cdf.show()

# Monte Carlo comparison
mc = sample_wallenius_numpy(M, n, N, odds, size=120_000, rng=rng)
counts = np.bincount(mc - x_min, minlength=(x_max - x_min + 1))
pmf_hat = counts / counts.sum()

fig_mc = go.Figure()
fig_mc.add_trace(go.Bar(x=xs, y=pmf_hat, name="Monte Carlo", opacity=0.6))
fig_mc.add_trace(go.Scatter(x=xs, y=pmf, mode="lines+markers", name="Exact (SciPy)"))
fig_mc.update_layout(
    title="Monte Carlo vs exact PMF",
    xaxis_title="x",
    yaxis_title="Probability",
)
fig_mc.show()

9) SciPy Integration#

SciPy provides scipy.stats.nchypergeom_wallenius:

  • pmf, logpmf

  • cdf, sf (survival function), ppf

  • rvs

  • stats, mean, var, entropy

About .fit#

Many discrete SciPy distributions (including nchypergeom_wallenius) do not implement a generic .fit() method.

In practice, you usually know M,n,N from the study design and only want to estimate the bias parameter odds. We’ll show a simple custom MLE for odds using logpmf.

M, n, N, odds = 140, 80, 60, 0.5
rv = nchypergeom_wallenius(M, n, N, odds)

x_min, x_max = wallenius_support(M, n, N, odds)
xs = np.arange(x_min, x_max + 1)

pmf = rv.pmf(xs)
cdf = rv.cdf(xs)
samples = rv.rvs(size=20_000, random_state=rng)

{
    "pmf_sum": float(pmf.sum()),
    "cdf_last": float(cdf[-1]),
    "sample_mean": float(samples.mean()),
    "theory_mean": float(rv.mean()),
    "theory_var": float(rv.var()),
    "entropy": float(rv.entropy()),
}
{'pmf_sum': 0.9999999999759758,
 'cdf_last': 1.0,
 'sample_mean': 26.67215,
 'theory_mean': 26.64903874100567,
 'theory_var': 8.208241761190717,
 'entropy': 2.471394837792706}
from scipy.optimize import minimize_scalar

# Custom 1D MLE for odds given data and known (M,n,N)
M, n, N, odds_true = 120, 50, 40, 1.8
x = nchypergeom_wallenius.rvs(M, n, N, odds_true, size=2_000, random_state=rng)

# Negative log-likelihood for odds (vectorized over data)

def nll_odds(odds):
    if not (odds > 0):
        return float("inf")
    return -float(nchypergeom_wallenius.logpmf(x, M, n, N, odds).sum())

res = minimize_scalar(nll_odds, bounds=(1e-3, 1e3), method="bounded")

{
    "odds_true": odds_true,
    "odds_mle": float(res.x),
    "opt_success": bool(res.success),
}
{'odds_true': 1.8, 'odds_mle': 999.9999700467109, 'opt_success': True}

10) Statistical Use Cases#

A) Hypothesis testing (bias vs no bias)#

A common question: is selection unbiased? In this setting, “unbiased” corresponds to (\omega=1) and the distribution reduces to the ordinary hypergeometric.

Given an observed (x_\text{obs}), one-sided p-values are computed with the survival function:

[ \text{p-value} = \mathbb{P}(X\ge x_\text{obs}\mid \omega=1) = \mathrm{sf}(x_\text{obs}-1). ]

You can also compare (\omega=1) vs (\omega\ne 1) using a likelihood ratio test (asymptotic (\chi^2_1) calibration).

B) Bayesian modeling (posterior over odds)#

Put a prior on (\log\omega) (e.g., Normal) and compute the posterior on a grid:

[ \log p(\omega\mid x) = \log p(x\mid \omega) + \log p(\omega) + \text{const}. ]

Because the parameter is 1D, grid Bayes is often perfectly adequate.

C) Generative modeling#

In synthetic-data pipelines, Wallenius’ distribution is a handy component when you need to simulate quota-based biased selection without replacement.

from scipy.stats import chi2
from scipy.optimize import minimize_scalar

# Example: test enrichment under the unbiased model (odds=1)
M, n, N = 200, 60, 40
x_obs = 18

p_ge = nchypergeom_wallenius.sf(x_obs - 1, M, n, N, 1.0)

# Likelihood ratio test: H0 odds=1 vs H1 odds free

def nll1(odds):
    return -float(nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds))

res = minimize_scalar(nll1, bounds=(1e-3, 1e3), method="bounded")
ll_hat = -float(res.fun)
ll_null = float(nchypergeom_wallenius.logpmf(x_obs, M, n, N, 1.0))
LR = 2 * (ll_hat - ll_null)
p_lr = float(chi2.sf(LR, df=1))

{
    "x_obs": x_obs,
    "p_value_one_sided_ge": float(p_ge),
    "odds_mle": float(res.x),
    "LR_stat": float(LR),
    "p_value_LR_chi2_approx": p_lr,
}
{'x_obs': 18,
 'p_value_one_sided_ge': 0.018665484494152507,
 'odds_mle': 999.9999700467109,
 'LR_stat': -inf,
 'p_value_LR_chi2_approx': 1.0}
# Simple grid Bayes for odds (prior on log-odds)

M, n, N = 200, 60, 40
x_obs = 18

log_odds_grid = np.linspace(-3.0, 3.0, 600)
odds_grid = np.exp(log_odds_grid)

# Prior: log(odds) ~ Normal(0, 1.0) (up to an additive constant)
log_prior = -0.5 * (log_odds_grid / 1.0) ** 2

log_like = nchypergeom_wallenius.logpmf(x_obs, M, n, N, odds_grid)
log_post_unnorm = log_like + log_prior
log_post_unnorm -= np.max(log_post_unnorm)

post_log = np.exp(log_post_unnorm)
post_log /= np.trapz(post_log, log_odds_grid)  # density w.r.t log-odds

fig = go.Figure()
fig.add_trace(go.Scatter(x=odds_grid, y=post_log, mode="lines", name="posterior"))
fig.add_vline(x=1.0, line_dash="dash", line_color="gray")
fig.update_layout(
    title=f"Posterior over odds (x_obs={x_obs}, M={M}, n={n}, N={N})",
    xaxis_title="odds",
    yaxis_title="density (w.r.t log-odds)",
    xaxis_type="log",
)
fig.show()

# Posterior mean and a central 90% credible interval (in log-odds space)
dlog = log_odds_grid[1] - log_odds_grid[0]
cdf = np.cumsum(post_log) * dlog
cdf = cdf / cdf[-1]

lo_log = float(np.interp(0.05, cdf, log_odds_grid))
hi_log = float(np.interp(0.95, cdf, log_odds_grid))

{
    "posterior_mean_odds": float(np.trapz(np.exp(log_odds_grid) * post_log, log_odds_grid)),
    "credible_interval_90_odds": [float(np.exp(lo_log)), float(np.exp(hi_log))],
}
{'posterior_mean_odds': 2.0282392492127608,
 'credible_interval_90_odds': [1.1560453396623622, 3.184447627708847]}
# Generative modeling: simulate biased selection outcomes for different odds

M, n, N = 150, 50, 35
odds_list = [0.5, 1.0, 2.0]
size = 30_000

rows = []
for w in odds_list:
    x = sample_wallenius_numpy(M, n, N, w, size=size, rng=rng)
    rows.append(
        {
            "odds": w,
            "mean": float(x.mean()),
            "var": float(x.var(ddof=0)),
        }
    )

rows
[{'odds': 0.5, 'mean': 7.433333333333334, 'var': 4.846355555555555},
 {'odds': 1.0, 'mean': 11.633933333333333, 'var': 5.991661862222221},
 {'odds': 2.0, 'mean': 16.652333333333335, 'var': 6.408194555555555}]

11) Pitfalls#

  • Parameter constraints: M,n,N must be integers with 0 n M and 0 N M; odds > 0.

  • Notation clashes: different sources swap the meaning of N, n, M. Always check the API.

  • Degenerate limits:

    • very large odds concentrates mass near (\min(n,N))

    • very small odds concentrates mass near (\max(0, N-(M-n)))

  • Numerical stability: PMF values can be tiny for large populations; use logpmf when optimizing likelihoods.

  • Performance:

    • sequential NumPy sampling is (O(N\cdot\text{size}))

    • PMF evaluation is nontrivial; SciPy relies on the BiasedUrn library for efficiency/accuracy.

  • Identifiability: from counts alone you typically cannot estimate M,n,N,odds simultaneously; in real studies M,n,N are usually known.

12) Summary#

  • nchypergeom_wallenius is a discrete model for biased sequential sampling without replacement from two types.

  • Support is integer and bounded: (x\in{\max(0,N-(M-n)),\dots,\min(N,n)}).

  • (\omega=1) recovers the ordinary hypergeometric; (\omega>1) favors Type I.

  • PMF/CDF and moments are usually handled numerically (SciPy uses BiasedUrn).

  • Exact sampling is simple to implement via sequential biased draws (NumPy-only), and Monte Carlo checks are straightforward.